ED 218 317 • 

AUTHOR 
TITLE * 
INSTITUTION 

SPONS AGENCY 
REPORT NO 
PUB DATE 
NOTE 

EDRS PRICE 
DESCRIPTORS 



DOCVMENT RESUME .. * ' 

TM 820 357 

Quellmalz, Edys , 

Problems in Stabilizing *the Judgment Process. . 
California Univ.; Los Angeles. Center for the Btuo^y 
of Evaluation. 

National Inst« ot^ Edu^cation (ED), Washington, DC. 
CSE-R-136 • - 

80 ^ ^ - J ^ 

27p. ' ' . ' \ 

MF01/PC02 Plus Postage. ' - " / ^ \ _ 

♦Measurement Techniques; ^Scoring; ^Testing Prtoblems;\ 
Test Reliability; Validity; Writing (Comj^osition) ; ' 
♦Writing *E^aluati'on; *Writing. Skills • 



ABSTRACT .... 

Measurement* problems which jeopardize the reliability 
and validity of competency-based writing assessments are analyzed. 
MethofSs to stabilize rating criteria and readers* application of them 
are necessary^. Most writing assessment programs use* guidelines from 
norm-ref eyenced test methodology'. Use of this method pf criteria 
application based on ranking within-occasion endangers or precludes 
between-occasion uniformity. Judgment stability, within a session and 
across sessions are other problems of measurement. The instability of 
ratings ha$ been a major weaknes^s of writing skills measures. Two 
^'ind^cators of rating variaijility" are discussed. Ratet drift is the ' 
r'a^^*'^^ progressive deviation withwi a scoring session^ from 
previously sTOred criteria. Scal^ /instability is the differential 
appliifration of criteria by raters |in* different, scoring sessions. 
During 'scale development and validfation/ assessments should collect 
separate ratings on component text features that comprise a total 
3crore.= Ra^iijg methods .shoifld ijfitersperse periodic. Qhecks.> Frequency 
of checks and nature of* f^edtfiack on scoring, accuracy are important.* 
Sound writing assessment regi^ire.i^ scale sta&ility. (Autnor/DWH) 




******************************* 

Reproducti oris' supplied by EDRS are thfe'best that can be"matie * 



from the original document. J * 

****,************ ***^*** ********************************* ^ 



ERLC 



« * ^ - , ' ' ^ W-S.0EPAirrMEIS5T0FED0CATI0W 

. * I * ^ » NATIONAI IN^TITI^E OF EduCATION 

/ • ! * ^ » ■ • * EDUCATIONAL RESOORCeS INFORMATION 

' ' ' ^ ^ . , ,i CENTfeR iERlfc) 

* * The document *b^ . been ^produced as 

. ' '- ' ' . r " received from the pcfjon or organtzatton 

^ ' - i . " ' ' % Of»S'"3ting n. - y ^ 

l<\ » . * V - « U [ ' Mtnor chaoses have made to improve 



CO 



, reproductton-QuaJdv ' 

— , -' 



• Points of view o^ opinions stated m this docu 
ment do not riec^ess^tjfy represent official NI6 
position or pQ<<cv< ' 



>ROBLEMS IN/STABIUZING THE jUaG«ENJ PROCESS 
Edys Quellqialz 



V- 




. . ' (;SE Report No, U6 

^ * . . 1980 ^ 



V 



'"PERMISSIGN TO' REPRODUCE THIS 
MATERIAL HAS-BEEN GRANTED BV* 



if 



I ' ' ^ ' ' . TO THE EDUCATIONAL RESOURCES 

t . . 1 • ' INFORMATION CENTER'^ERICr 



* -'Hi. 



6ENTER EOR THE STUDY OF EVALUATION 
Grad'vate School of Education 
UniverSi^iof. CaJJ^fornia, Los Angeles 



p . - 

' ■ , <» 



i - 



» *■' r 



I 



The projectvpresented or reported herein V(as performed' *' 
'\ pursuant to 'a grant ijf^om th^^National Institute .of 
. Education i ITepartnTent of Education.' However, the -opinions/, 
expressed herein do not necessarily reflect the position " 
or policj' of the National .Tnstitute of Education, ^and no J 
official;^ endorsement by the National , Institute erf 'Educatfon • 
should be -yiferred, y ^ , ' , • 




•1. > 



0; 



I. 



' ^ . ' , Abstract 

Problems in Stabl'i zing. the. Judgment Procesjs . ^ 

. ^" / /- .-• • • _ . - ■ - ; 

vThi*s article analyzes" a series of measuretn^nt.probleins' thdt jeopard- 
i ze. the reliability and validity of 'coijpet&ncfy-fiased^ writing assessments. 
The paper ^distinguish^ between two indifcators of ranting variability 

' s ' ' 

1) ratej drift rater^'s progressive deviatiibn within a scoring session 
from previously shared criteria; and 2) scale instability differential 
application of criteria by raters in different scoring sessions. Examples 
^.from research illustrate the^nature and magnitude of rating fluctuations. 
Prqmising techniques are described for stabilizing raters' judgments and 
documenting scale stability* 



Problems jn Stablizing the Judgment Process 



Edys Quellmalz 

Center for the "Study of Evaluation 
University of California, Los Angeles* 



Ther increasing demand for competency assessments of'complex human 

performance has led to renewed scrutiny of the* conceptual and technical 

qualit/ of^prevailing testing pra^ctice. Particularly in tlje area of 

'* ^ , ' 

language production, i'^e* , 'wri tflig, oral language and^oral reading, re- 

searchers and practitioners assert that competency tests must provide 

tasks that match performance objectives and that activate cognitive 

processing strategies required by production rather than recognition 

tasks* .The validity of indirect (i*e., multiple choice) measures is no 

longer logically, psychologically or ecologieally acceptable to the 

majdrity of professionals in writing instruction and ^evaluation. Life' 

i$ not a multiple chc^tce. Students* language production skills-, in par 

ti'cular, must^.be sufficiently proficient for students ^to functi6n auton 

omously in the real world. \^ . ^ , ^ ^ • ' 

Although collecting samples of complex performance pan presumablj^ 

provide "direct,"* valid measures of content, the renowned unreliability 

of judging constructed responses continues to ^plague assessment method- 

* • • ,^ . * 

'olbgy. Because direct performance samples are mediated by highTy. vari- 

able judgments. of raters j^ho score or characterize performance samples 

* along some dimeniians, a criticaVgoal for perfpnTiiio^*p judgment in gen- 



eral, and for writfng judgment in particular, is to find ways to assure 
that judges apply scoring criteria accurately and fairly. An | part oT 
a broader program studying issues in test design, we have investigated 
dimensions of the test tasks, context and scoring that will, reduce irrele- 
vant variability in examinee and^ rater behavior. ^ , , . 

This paper analyzes a series of measurement problems that jeopardize- 

V 

the validity of the judgment process and examines the effectiveness of 

methods currently employed to address- these problems. Reviews, of pre-' 

vatling .rating practices, in conjunction- with cumulative empirical evidence 

on factors influencing judgments in domain-referenced.assessrment,*demon- 

strate that'direct writing assessment faces a dual* val tdity requirement. 

♦ » 

Both the test task and the scoring procedure giust meet separate conceptual 
and statistical validity standards. The paper elaborates the requirements 
for accurate and fair writing competence assessment and illustrates how 
state-of-the-art rating processes pose serious .threats to the validity of 
the writing assessments. • 

Doma'in-Referetiiced Scoring Requirements ^ 

The avowed intent and stru?:ture o^f Qompetency or domain-referenced ' 
tests require explicit, replicable scoring criteria and procRdui^s; thus,/ 
the nee'd for methods to stibilize' rating criteria and readers* applicafion 
of them is immediate and teal.. Soon the uniform application of -performance 
criteria .may'' becbne a legale requirement when decisions based on thes*e test^ 

/eisult/in 11fe-»altering conse'quences for students^ - Mandated proliferate^ 

• * » • ' ' . * • - * 

at'ttate^and local levels for writing assessment at all. levels of public • 



school, and .large' numbers of writing samples must be scored by great 

numbers of raters . Many assessment programs are required to provide stu- 

dents repeated 'opportunities to pass comparable forms of a test^ Also 

built into many assessment programs is, a requirement to administer com^ 

parable tests, at regular internals, at geographically separate. sites. 

The purpose of these competency assessments is to mcfnitor develop- 

* * 

ment of students'' skills at points. specified throughout their schooling, 
to detect skills for which they might need rentedial assistance and to 
document skill development. A student who fails to demqnstrate competency 
in. writing, receives,additional" instruction, and is then retested should 
b^ j\j<lge§'^ according- to the same standards at each test administration. 
His or hV^ score should not depend on either the performance of a new 
cohort of examinees nor upon th'e. idiosyncratic values of* differently ori- 
ented sets of raters. ' * ' ; 

• Unfortunate^ly, many^ writing assessment programs derive^ their guide- ? 
lines from norm-rjeferenced test methodology. In practice, norm- referenced 
writing tests are scored by ranking papers within the limits of a partic- 
ular sample. Essays are usually scored holistically, on generally de- . 
scribed criteria, and involve scoring procedures where raters r^nk essays 
by sorting them into piles anchored by the range ^of 'quality^ of that 
particular sample (Conlan, 1976). Thus a particular ^paper's rank and/or 
score could change from sample to samp'l^, if the range of the quality of 
the competition varied from \x\^ test group to' the next. Such practices . 
result in a "sliding scale" where the rated quality of a particular paper 
changes according to the quality range of papers. In the groupi For exampl 



a stuctent might take a writing cqmpetency test in the- fall,' when all' stu- 
dents, low achievers to college preparatory students, particip'ate. A 
student's rank in this wide quality range is belcfw mastery. In the spring, 
the'student, along with the restricted range of students who fail^ the .. ' ' 
first administration, takes another writing competency test and ju§t passed. 
Does s/"he pass because* intervening writing instruction -hjas strengthened 
weak writing skills, or because her or h-is rank is higtfer in the restricted 
range of poorer writers? Present hdlistic scoring procedures can not pro- 
vide an answer to this question. The holistic score provides no evidence 
of the developmental level of specific wrtting weakness that were low and 
may have improved. Desptte.the use of "anchor" .papers during training to 
"illustrate what a "6" or "3" liad been for 'other groups, th? most prevalent 
holistic scoring procedures still require raters to distribute papers 
across the score range. • 

A major measurement problem confronting many compe|;ency based writing 
assessments, then,' is the fai^lure to deal with the need to assure compar-' 
ability of scoring between 'test occasions as well as w.ithin a scoring 
session. Such comparability would require not just statistical indices^ 
of rater agreement but comparisons of .mean. scores, since ratings within 
■ a session might. agree but differ between sessions. Adopting a norm-refer- 
enced method of criteria application based on ranking within occasion 

• > . . . 

imperils, if not precludes b^tween-occasion uniformity of criteria ap- ^ 

•plication. "Hierefore* two measurement problems inhere in' judgment^sta- 
bility, stability within' a session and stability across sessions. <> 



To docuraenf scalers tability, an assessment would have to intersperse 
anchor papers scqred in previous assessments' among .papers rated v;ithin 
.an on-gojng rating session and report .corr.parabi.lity of anchor paper scares • 
across test occasions and rater groups. Such documentatioil of compara- ^ 
ability is consptcuously 'absent in both research and practice. * ' 

Research on Rating Va-ri ability 

jEvidenVe pointing to the sources and manifestations of scal^instability 
can be found in the rapidly accumulating body of research on issues of 
rating variability. Thfe instability of ratings has been a major and gen- 
erally acknowledged, weakness of measures of writing skill (Coffman, 1971b). 
Braddock, Lloyd-Joaes and Schoer (1963) classified foOr sources of error: 
1) the writer, 2) the assi.gnment, 3) the rater. A) between raters. Although 

considerable research within the framework of domain- referenced testing 

J* 

has examined dimensions of the 'test task that influence writer performance 
such as discourse aim and topic modality (Pitts, 1978; Spooner-Smitfr, 1978; 
Quel Imalz/ 1979; Praeter & Padia, L980; Crowhurst, 1980), less attention 
Jias been given to the factors, involved in rater behavior. ^ 
In the broadest sense, inter- and intra-rater variabil ity ^re a matter 
of fluctuating standards of judgment. Research has amply demonstrated . 
that anarchical sgoririg of essays,^ where raters Ripply their individual 
standards, results in highj;.<Jisagreement among* raters from different occu- ■ 
pations (Diederich, French, & Carlton, 1S51) and even among English pro- 
fessors (Findlayson, 1951; McColly, 1970).". Follman— tnd'-Anderson (1967) - 
demonstrated that the more homogeneous the background of ra'ters, the more ' 



their scoring agreed. Long'ago, Eels (1930) demonstrated the problem 
of intra-rater criteria. Mas when he found that the ^variability in essay . 
scores assign§d even by the same reader on different occasions approached 
the degree bf variability of* scores assigned by different readers, Recog^ 
'nixing the magnitude of error occurr;?ng in unstructured scoring, researcher 
attempted to devise various techmques for cSrrtrolling score variability. 

Methods for Co'ntrollinq Scoring Variability 

— • f ■ — ■ — : • "~ — ^ » ^ 

The first and most critical' step in stabilizing the bale's of readers' 

judgtnents is to estabtish common, explicit scoring criteria. .Criteria 
may either be specif ied- <leduct'ively "by invoking standar'ds derived from 
the rhetorical tradition (efg. , 'Kinne.avy, 1971) or f nducttvely by seek- 
ing commonal ity. among readers' commehts on papers (Diederich, 1974; 
Freedman, 1978). Systemajtic fer^aininp on cornnon scoring criteria has . 
proved to reduce some kinds of interr^ter variability effectively. ^ 
(Stalnaker, 1934; Diederich, 1974).. As,a result of these pioneering ^ 
studies, standa/d methodol'ogy now includes training of ratiers on the use 
of rating scales until a high level of agreement among raters is achieved. 
^-In a recent study bf the distns^i native validity of alterna^tjve scoring 
rubrics, Winters (197§) suggesteil that high rater reliability coefficients^ 
in- pilot or in final rating sess/ons might not necessarily signal standard, 
uniform interpretation of rating scales over rating. occasions and across 
rater groups. During rater training she*^observ%d that less operational ized 
scale rubrics stimulated extensive discussion and interpretation and sug- 
gested that tlifferSht raf^r grbups might achieve high reliability, but 



have interpreted vague* criteria differently by devising different specific 
decision rules for ttie same ambig'ubus criteria. Thus,^iqh reViabllHy 
coefftciegts might, be obtained, ^but at the cost of accurate, replfcable 
scoring. As Winters implies, redefinition of criteria by the social rating^ 
group can haye serious implications for the fairness of ratings across 
rater groups. ^ 

Rater Drift > \ ' , , ' . 

Even with training for rater consensus,^ when rat;^rs practice applying 
'explicit criteria, rating fluctuation may ?ti]l occur. *, The deviation of 

raters fr^m previously-shared, criteria is termed "rater driff'^ancl may be 

' r ^ ' * 

signaled by lowered inter-rater reliabfliy and d1fferenc§s between raters' 

criteria interpretation and \ii^^ert-generated criterioji-based' ratings. 

Rater drift is pAjPticularly a problem when^ there are large* sets of 

. papers to be scored.' Shifting criteria or drift may be caused by rater 
fatiguB, or by more systematic influences, such as the quality-range of 
tha sample of papers being read or idiosyncratically valued criteria; ^ 
In a» description of the rater as ,a source of error, Braddock et al . 

. (1963), discussed the need, for controlling for rater fatigue. They cited* 
fatigue as a cause for raters to become severe or erratic in their eval- 
uation or to place more weight on particularly, riotixeable essay elements 
such as mechanics. Godshalk, Swineford, andXoffman (1966) found signif- 
icant differences between papers scored holistically early and. later in a 
set of 646 papers. Coffman (1971b) warrted that even when, two set^s, of ^scores 



ERiC II 



4 ,. 

• . ' '. - ■ ■ \ 

derive from changing combinations of r^ters^ ''there may, still be differ- 
ences in the means and staodard deviations ^ittributable to order effects^ ] 
— "^hat is*, * the tendency'of groups of raters to shift their standards a^ • ' ' 
the reading proceeds" (p. 276). Coffman (1971a) also discussed raters* 
tendency to regress to their own internalized .set of. starrtlards 'and recom-_ : 
mended practice on common criteria.^ , ... 

Rater drift impairs the technical quality of rating results by reduc- 
ing inter-. and intra-rater reliability, and more importantly, compromised 
the validity ofVatings. However, writing assessment programs do rjot seem 
to> acknowledge rater drift as a validity^ problem, nor do they deal with ^ 
rater drift direct]^ , \ 

S.tate-of-the-Art Procedures for«Treating Scoring Variability .ts 
'Current rating procedures (Conlan^, 1976;- Office of the Los Angele^ , 



Superintendent of Scho&l§^97Z) generally folTo^methods r^econwend^d by 
Braddock et al . (1963), and Coffman (r971a) and ha^e#evolved a number . ^ 
'^of methods to,deaJ with rater ^variab^l ity. ' Typically, raters begin by 

practicing applying a 'rubric to ^a^ sample sef of papers. The nature and . 

^. * * ^ 

relative specifici.ty of scal6 criteria ^nd scoring fgrmats ( holistic vs. 

analytic ) vary, as do the wei'ghts of#compbnent criteria. Before independent 

rating begins, tra.iners conduct a reliability oheck., .Sometime^ cdnsensus 

is c-hecked statistically; sometimes ,it is indicated by a 'show of hands,* 



During independent ratings, methods fqrv^ealirig with>a,ter agreeHient 
tend take two tacks: correction and maintenance^. "Pracedur^^^hjch 
emphasize correction^use post hoc^ methods to treat score discr^pancife.^^ 



ERJC' ^ ' , - ; 12 



Common opt i|)ns are: 1) having a thti:d reader score any paper'iWhere the ' 
first readers disagree hy more^than one point; 2) u^ing the sum of two ' 
ratings as a total score; 3.) randomizing ^he order in which two raters 
store an essay in order to distribute rater errpr, although often^the 
randomizafiori occurs in a single\day. These' post hoc correction proce- 
dures sidestep the validity problem of the changing criteria; Snployed 
by the drifting r^ai^qr. 

A second set of procedures for dealing >/ith rating variability aims 

at maintenance of scoring accuracy, ^R^.iodic consensus chefcks on iden- 

^^^^ ^ • ^ ' 

. tical pap.erS 'ere interspersed at varying' intervals. Checks may be common 

to all ^pt^^s4i|^crussed in the group, discussed within rater pairs or . - 

discus^ ^v/ith^a "hia5ter'* rater. In the , procedure, discrepancies are 

• * 

.calTejd to the rater's attention and their bases revised. These main- 

tenance procedures at least attempt to prevent, -detect, and control scor^ 

ing error by providing fee^s^|^ib, individual raters regarding the accuracy 

and consistency of their scorlig, decision rules; ■ . _ . 

' •■ . " .... * 

Rating Variability in Compet^y Assessment Research . • 

• I-n. a series of studies examining dimensions important in the formu- 
lation of val id, irvstructional'^^'^ns'itive v/riting assessments, we v ^ 
documented -l^ -effects of sevj^^^liffhgent procedures for attaining, artd 

**** 

maintaining raW congruence and" fidelity to the rating scale. One com- - 
ponent of the methodology was to develop analytic scoring rubrics referenced 
to. basic structural features of a discourse mode. Explicit criteria were ^ 
designed to reference. operational , instructional ly.manipulatable elements 



of the paper. Rater*s practiced applying^the scoring rubric in intensive 

• t 

, training sessions and reliability checks using general izability statistics 
were calculated to assure inter-rater reliability. During final, indepen- 
d^t rati,ngS'l conunon checks occurred, at frequant intervals. Discrepancy 
resoiution'^rocedures' we;j4 o-f several types, including group discussion 
or pair discussion.. The research focus of these studies was on variations 
of the tasks of^writing rather tban on variab.lies influencing the rating 
process, yet the accumulatitig^data indicated t;hat stabilizing the judg-..^ 
ment process was a comple^ issue--one deserving direct experimental in- 
vestigation. This conclusion derived primarily from three of our studies 
in which we observed iftrterslrift surface as a problem, despite the differ- 
ent procedures used to prevent it. We also began to inspect indices of 
scale stability by looking atscores given by raters trained at different 
times to the same set of papers. 

Rater DriH . w - 

In our writing assessment research our initial scoring concerns were 

to establish and maintain rater agreement. To determine that this occurred, 

t <■ * * 

we conjpared rel/iabi Titles obtained immediately after training (on a pilot 
test Of independent ratlings) and after the final ratings. Table 1 presents 
a comparison of general izability coefficients marking rater" agreement 
levels on pilot and final ratings. ' , 



Insert Table 1 here 
• - * 

Tile first rating procedure was employed 4n Study 1 where Spooner-Smith 



. Table 1 ' 

Comparison of 'General izability Coefficients f,or Rater Agreement 
Immediately After- Training and After Final_Ratings 



Study 1 -.Expository Scale r''(Spooner-Smith, 1978)' 







F . 


Dev 


a 


Su 


Pa 




Total 


• ♦ 




"GC 


GC 


GC . 


•GC 


~GC 


GC 


GC 


Pilot 


- 4* raters r 


=15 - , .94' 


.92 


.94 ■ 


.83' 


.-94 


.80 


.90 


Final 


- 2 ratings 


n-112 .84 


.80 


.85 


.85 


.80 

* 


.95 


.90 




Study 2 


- Expository^5cal.e II (Quellmalz and 


Capell, 1979). 






GI 


r 


0 * 


S 


M 


Total 






• 


GC 


GC 


GC 


GC 


GC 






Pilot 


- 4 raters 


.74 


.63 


.74 


.77 


.73 




Pi n;^ 1 
r 1 iiu 1 


- ? rat i na^ 


.67 


.59 


.61 


-.57, 


.52 , 


. .66 


/ a 


* 




Narrative Scale II 


t 












* 


GI 


F , 


0 




M ' - 


Total 








GC 


GC 


.GC 




GC 






Pilot" 


^ raters 


V .86 


.76 


.79 


.76 


.52 






Fjnal 


- 2 ratings 


.•84 


.60 


.72 


.72 


.69 


,.83 


* 


9 


Study 3 


,- Expository Scale III (Baker and , Quellmalz 


, 1980) 








. GI 


Gen Comp 


Coh 


Po 


Su 


M' 


Total 






GC 


* GC » 


GC 


GC 


. GC 


GC 


GC*- 


Pilot 


- 3 raters 


.74 


.65 


,86 


.93 


.84 


.71 


.89 


Final 


- 2 ratings 


.66 


.71. 


.62 


.83 


.71 


.76 


.81 

< 






Narrative Scale III 
















GI 


Gen Comp 


Coh 


Po 


Su 


M 


Totarl 






GC 


"GC 


GC 


GC . 


GC 


GC 


GC 


Pilot 


^ 3 raters 




' .75 


.62 


.87 


.54 • 


.85 


.79 


Final 


- 2 ratings 


.70 


.76 


.53 


.87 . 


,.67 


.68 


.81 



■'KEY 



GC~= General izability Coefficient 

Study 2- (Quel Iftial z and Study 3 (Baker and^ 



Study 1 (Spooner- ouuuy t \i<uci imoii j ~ »"-r- monA 

Smith, 1978) , . ^ Capell. 1979) quellmalz-, 1980) 

F = Focus GI = General Impression GI = General Impression 
Dev^ = Development .F = Focus . ^ Gen Comp = General/Competency 

0 -=' Organization 0, -Organization. .Coh •= Coherence 

Su -Support > .= Support ' Po = Paragraph Organization 

Z D,uLu,ot,4;^" - igi^ i Mdrh;,mcii * ~t-$ti Support 

' .''M " Mecjjanics 

- '-'i' t = Total ' 



-Pa = Paragraphing" = Mechanics 



= Mechanics 



^.Tota.l "'tXy^:--' 



Total 



, (1978) compared direct and ind1i|ect measures of \>friting competence. Four 
raters received five hours pracljice applying- an analytic rubric, Exposi- 
tory Scale I, to a set of papersl representative of the experimental set. 
The top table presents Spooner-Sinlth's in terra ter reliabilities for four 
raters on the pilot test conducted immediately after training and on the 
final independent ratings of the experimental papers. During the final 
independent scoring, raters read\ rated and discussed discrepancies on . 
a common paper as a group approximately every hour to check adherence to 
criteria. While the total score reliability on the -final ratings remained, 
high, reliabiTitiefs of four of the six su^cales dropped as much as .44, 
indicating some degree of r^ter drift from oHgina^ consensus levels. 

The second rating procedure occurred in Study 2 (Quellmalz & Capell, / 

* * ** , 

1979) which compared writing performance in different discourse and re- i 

' ' ' ^ * ! 

spons"^ modes. Following scale training procedures employed by Spooner- 
Smith (1978), p^ilot tests of interrater reliabilities for two revised 
analytic rubVics, Expository Scale *II and Narrative Scale II, checked 
lev^l of agreement of 'the four raters prior to final rating. Additional 

' training* occurred on any subscale where the general inability coefficient 
was 'less than •.TO, During final scoring, rater pairs read and discussed 
common papers 'after every 20 independent ratings. The two tables for 
Study 2 indicate, again,* that agreement levels on the total scores were 
acceptably high, but that reliabilities on three of the expository sub- 
, scales deteriorated as much as -,20. The interpretation of these datci - - 

^was that. the frequency and nature of the common check procedures were still 
not curbing- rater drift adequately. 



Cqmsequently, Stijdy 3 implemented a revised rating procedure. Study 
3- (Baker & Quellmalz, 1980.) .investigated ttie effect of mc^dality^f topic , 
presentation on eighth grade writing performance. Three raters partici- 
patedviri scale training for analytic Expository Scale III and Narrative 
Scale til. Following a piloj; test of inter-rater reliability, the three 
raters independently scored the e.?cperimental papers.^ Each pap^r received 
two ratings. Common checks occurre^every hou r| a ad wer^ discussed ^by the 
entire group. 

As the' two tables for Study 3 indicate, agreement levels fall on 
General Impress-ion, but not on the General Competency rating'. • Reliabil- 
ities plummetted on the expository Co^renee^' ratings arid, oh the Mechanics . 
ratings of the narrative scale. These comparisons of pilot and final re- 
liabilities for Study 3 suggested that the revised cliecking ^Drocetfure was 
generally maintaining rater agreemel^^^but sti-11 dfd no tm;event* drift on 
some subscales.. - ^ / l\ 

In a more detailed inspection of the emergence of ratef djjift in Study 3, 
we ^^lso compared reliabilities and mean scores on papers.scaped early 
and late in the rati/ig sej^uence (see- Table 2). Table 2 presents the early 
VS. late comparisons for Expository Scale III and Narrative Scale III. 
On the expository scale, reliabilities across all rater pairs remain high 
(a .76 to .85) except on the General Impression and Coherence subscales, 
Parametric comparisons of m6an scores on early vs. late papers did not » 
reach statistical" signific^ance,but late scored papers received* sl^ightly 
higher ratings -than early scored papers. ' / • 

Reliabilities on^'Narrative Scale III remained hi^fl on General Compe- 



■ t 'I 

i 



TABLE 2 



Cbmparis'oij^of Early vs.. bate Scored Papars in Study 3 
(Baker-and Quellmalz,' 1980) ' . 

Expository Sdele III- 



Paragraph 

Organization 

Support 

, Mechanics • 
« 

Total 



General Impression 

General Competence 

Coherence 

Paragraph 

Organization ' 

Support 

Mechanics . . . 
* Total 



' Inter-rater 
Reliabilities 



Mean Scores 



Early 

. General Impression a .85 
GeneraT Competency -a .75 
Coherence. 



a .78 

a .87 

a .78 

/a .'67 



a .(37 
n^40 - 



a .78 

a .81 

a, .77 

a -.93 

a .84 

a .68 

a j.90 
n«40^ 



Late 
.69 

.77^ 

# 

.57 
.86 
.76 
•82 „ 
.85 



Early 



X 2.28 

S.D. 1.07 

X 2.20 

S.D. .91 

X . 2.39 

S.D. .88 

X 2.03" 
S.D.'' 1.05^, 

X, .^.99 

S.D. .85^ 

X 2.1B . 

S.D. •85'' 

X 14.78 

S.D. 4.86 



Late 

2.29 
.85 

2.43 
.86, 

2.63 
.90 

2.22 • 
1.08 

3.11 
.90 

-2.99 
.76^ 

15.89 
4.49 

.--BfiO. 



Narrative Scale' III 



.71 

.78 

.46 

.85 

.84 

.80 

.86 
n=50 



X- 2.62 

S.D. - , :.l2 

X 2.54 
S.D. ^87 

X - 2.60 
S.D. .99 

X 2^ 
S.D. .1.2^ 

X 2.82 
S.D. .97 

X 2. '"30 
S.D.-" .80 

Y 14.35 
S.D. 4.94 
n=40 



2A9 
.73 

2.20 
.78 

2.31 
.59' 

2.03 

LOD- 

2.51 

,68_ 

2. 16, 
.'.74 

13.03 



t 

.97 

/ 

.23 
■'.21 

.40. 

.51 
•1.08 
-1.06 



2.31- 
1-.-84 
1.60 

.74 
1.6aT 

.82 
1.49 



n=bC 



• a = alpha- coefficient 
• * 

: p<.05 - 



.8. 



13 



tence, Support, Mechanics and Total score. General Impression reliability . . 
dropped .08, Coherence dropped substantially (a .77 to .46) and Paragraph 

V 

Organization fell (a -93 to -85)/ Contrasts of mean differences between 

early and late scored narrative papers revealed a significant difference 

on General- Impression ratings^ Papers scored later received lower ratings . 

it ' 

than those scored earlier* All subscale scores were lower for late scored 
papers^ These findings are consistent with other research (Godshalk* et aK, 
1966) that reported raters became. more severe as scoring progressed. In 
study 3,, Expository papers. were scored before Narrative papers, so late 
scored Narrative papers were at the very end of the entire^ scoring sequence. 

Inspection of the scoring. data from the three studies suggests that 
;*ater drift within a Scoring session can^occlr and weaken scoring rigor- 
Raters judgq^^hts . waiver ed on some subscales more than others, signalling 
a neM for more careful explication of criteria on those subscales and* 
pr^cti^ce on their application. Since state-of-the-art procedures for * ' 
cont^rol ling mater drift were employed and even reftaed iri these studies, 
thi.data implied the need to continue to examine methbdo^ogieS^lof detect- 
insf aitd. preventing rater drift • 



Scale Stability » ; ^ , . - > 

A validity concern co#dinate with maintenance of scale ^fi del ity H}th- 
irr rating occasion is assurance of judgment accuracy across rating occasions • 
standards of faarnesSvand methodological rigor mandate that criteria apply 
uniformly across sets of raters and" sets of papers. -* ' - ; 

^ Prevailing practice does not seem to recogfi^e stability 'as a technical 
pro|>lem'/ Large scale assessments do not routinely report and inspect a 



sepjes of rater reliabilities for sfparate scoring sessions. Even re- 
liability indices are not sufficient, h^-zsver. Comparisons of mean spores 
on common papers should supplement reliability statistics. Scale stability 
-could be demonstrated by comparing scores on a common set of papers given 
by different rater sets trained separately, or by comparing scores from 
the same raters rating at different occasions. While we have not yet in- 
vestigated this phenomenon within an experiir.ental paradigm, we have, how- 
ever, inspected scoring data gathered duringnhe process of pur other writ- 
ing assessment research' in an attempt to understand the nature of variables 
influencing scale stability. 

Our Table 3 presents -the means and standard deviations of essay scores 
given b^ two different rater sets_ to the same papers. Raters A and B 
scored 30 expository essays. Rate>2 pairs 1, 2 and 3 rated these same 30 
essays in the course of Study 3. Rater pairs 1, 2 and 3 were using Ex- 
^pository Scale III, a revision of the, analytic expository rating scale 
used' by Raters A and B. Therefore only scores from those subscales that 
were, not' significantly changed were entered into the analysis. Agreement' 
levels were not calculated due to the smalKsample size. 

Inspection of .the means reveals that Raters A and B gave generally 
.higher ratings' than Rater pairs 1, 2 and 3. Comparisons of means for 
each sub'scale ^d t'he total sccrre were al] significant. While the small 
number of "papers 'clearly limij^ interpretation of these data,\they do docu- 
, ment that'criteria definition and application did change 'from one rating 
- session to the next. * 



t 



Jable 3 • 

•■• V ' • , ' ' 

Comparison of Esaay .Scores* 6iven by Different Rater Sets' 
on Separate Occasions » ' * ^ ' 



'''' 'J Ratings .•: 





• 

Subscales • ' 


s.d:-. 

ii 


Occasion 1 
Raters A'Und B 


Occasion 2^^ 
Raters 1-6'* 


t 


df 


*^ 


General 
Competence 


' . 2.92 / 
.62 


. l'.65 . 
'.38 


2.77* 


67 




, ' Paragraph 
Organization 


x> ■ 

s'.d. , 
n 

0 


w» 2.19 : 
■ .98 
29 


. .50 

• 31 . 


3.6.7*. 


58' 




Support 


.X 

s.d. . 


2.76 
1.08 
29 


- * 

2.W 
, .50 ■ 

■ 32 


3.25* • 

t 


.59 


#• 


Total 


s.d. 

' n' ^ 


11.81 
' -. 3.17 * 

^ . , 29 .- 


» '. " 

. 8,97 
■ • 1..76 

32" ' 

\ 


4.38* 

\ 

**■ 


59 




* 













Scores by rater pairs 1-6 were transformed fr^m'*^ score range of 1-6 to 
' "1-4 to permit analyses. . / . . , . 



21. 



f 



'4* 



* .f 




15 



In addition to' lda^i*i?f WtfctR^ 'scores differeat raters trained at 

- • . . .0 

separate occasions gavegto, "th^^tte s-et ;)f papers^we inspected intra-. 
' rater agreement of scores a peLirvOf , ^-aters g^ave- to cbrrandn 'papers scored ' 
' at 'different sessions. Tabl>l 4: »l,spj ays means, a nd^ standard deviations 
of a rater pair (N) which par^l^ic-^^t^ in* two different rating sessions^. 

-:r 



.lnse|(V Table 4 h:ere 



Id Study I, rater pair~5 M and J", scored >"ssays from a general high school 
population which were then. "halted in" ^ set of c&^ljege. admission essays 
r^ad for Study . 2, . In Study 2,^ pair N read the ei^ht essliys they had scored 
previously in Study 1 and 8 ^ddttlona.! essays from that study that they 
had not personally scored, 'ihe^means of pair Niri the two studies are 
fairly compegraWe except on Support; and Mechanics •! In contrast, the nj^ans 
ofepa'irs'M and 0 are substantially different,. Pair 0 mean§, are- consistently 
Jower/ The greater stability of means for pair f+.rjiay suggest that they 



were applying witeria in a;yn1f%i manner. P'air 0 was probably *nfluence(^ , 
by the overall higher quality i5f * the college abdmissions sample-, thusinaking 
the "malted in" general population high school seen worse. Methods for elim- 
inating this subtle " no rmi ng!\ or^ presumably expl icit criteria to the quality . 
range of particular sample'*js a^.phejjomfena' requiring further research. 

Our intent in inspecting these admittedly limited' data was to illus- 
trate on6" method for tracking tHl stability- of ratjfng scale aippl ication. 
Writing assessments could systematical ly include af "check" set of papers 
in each rating session to document the compar^biO i ty of judg.es' decision 



TABLE 4 



Comparison of Rater Pair Scores Across Studies 



Rater Pair 

CSE Subscale 

General 
Impression* 

Fdtiis 

Organi zation 
Support 

s ■ ■ 

Mechanics 
Total 



4 



's. study 1 
M . N 



X 


.1,92 




r.32 


n 


6' 


X 


2.08 


s.d 


.38 


n 


■6 


X 


2)33 


s.d.- 


t/.98 


n, ' 




X ' 


2.42^ 


s.d. 


.92* 


'n 




X . 


2.50 


s.d. 


.84 


n 


. 6 


X 


11.25 


" s.d. 


3.71 


n 


6 ■ 



1.28 ■ 
1.37 
"8 

1.71 . 
.9 

8 

1.65 
.6 ■ 

'8 

2.76 
1.15 
8 

2.20 
.70- 

8 

9.60 
2.79 
8 



Study 2 
N 0 



-) 



1.00 
1.13 

. 16 

'1.69 
.48 

16 

1.72 
86 

-16 

- 2.00 
1.78 
16 

1.78 
• .77 
16 

8.19 
3.40 
16 



-7 
.94 

■^1 .91 



1.53 
.50 

16 

J. 38 
.50 

16 

1.^3 
' .50 
16. 

1.75 
'.58 

.16 

7.21 
2.53 
16 , 



23 



rules at different rating Sessions. We believe that scale stability • 
across topics, quality range of papers and set^ of raters can be achil^ed 
and that the factors influencing scale stability require systematic in- 
vestigation., ' . v • . ^ . 

Summary and Recornmendcftions 

The need for stabilizing the scoring process is critical to the val- 
idity of writing assessments. Direct >^vidence of student writing compe- ^ 
tence, actual written production, is a necessary condition for -content 
and construct validity; it is not sufficient, however. Rater's judgments 
must be replicable and defensible. We believe that explicit rating cri- ^ 
'teria are a condition for defensjibili ty and replicability. 'Our rater drift 

comparisons suggest that total scores and a holistic score seem to mask 

HI 

fluctuations in judgments on the elements that contribute to the more 
,g'lobal summary scores. We suspect that, at least during scale development 
and 'validation, assessments should collect separate ratings on, component 
text features such as Support and Coherence that contribute to a totaV- ' 
score. Otherwise,* there is no way to identi^fy and track consistency of 
the bases for global judgments. 

Certainly, scale trcifning and an initial reliability check is essen- 
tial. Rather than relying primarily on randomization or statistical pro- 
cedures to correct for rater drift post hoc , rating methods should inter- 
sperse periodic checks into lengthy,, independent s.coring. The variables 
making these checks effective for maintaining agreement and scale fidelity 
require further investigation. Frequency of checks is one important factor; 
'the nature of feedback on sconing accuracy is /even more essential^. We 



are currently coaducting 'research on methods fof^ curbing rater drift. 

Scale stability Is a critical validity issue for competency-based 
writing assessment; Large scale assessmerils can, at least, document 
stability by track-ln^^ scoring of a core set'^of papers by different groups 
of raters. Methodologies for selecting ^nd prev'fenting scale -instability . 
should also receive direct experimental attention.,?^ Fair, informative, 
general izable% defensible scoring/procedures are jriecessary r'equirements 
of sound writing assessjnent. - * . > 



. ^ References 

Baker, £. L.-r& Quellmalz, E. 'S. Issues in eliciting v/riting performance: 
Problems in alternative prom0ting strategies. Paper .presented at the 
annual meetirig of tlie. National Council on Measurement* in^ Education <*: 
Boston, -"April 1980. * \ ; ' . ' . 

Braddodk, R., Lloyd-Jone% R. , Schoer, J. Research ih writtfen composition . ■ ^ 
Urbana, 111.:: Nation*, Council *of Teachers of. Engl/islT, 1963. 

«. ' ' ^ * 

CofffTian, W. E: Essay ExaminatiofiSl'" 'In R, L. Tjiorndike (Ed. ) ,^ Educational 

J4easurement (2nd ed.). Washington, 'D. C. : 'American Council of Education, 1971a 

Coffman, W. E. .On the reliability of ratings of essay examinations in\^ ^ 
-^English.' Research^! ri the^ Teaching of English ,. Vol . 5(l) , Spring .197ib. ^ / ^ 

Gonlan, 6. ' HoWthe. essay in the CEEB Entfigh test .is scored . 'Princeton, — 
/ N. Educational Testing Service, 197.5^..>\ ' * 

CrowhurisJ^Jl. Syrvtactic cot^plex^ity in narratidn and argument at three 
grade. I'evejs. Canadian Journal of Education , 1980. ^ m> 

,f '\ " - ■ • , ^ N \ - 

.Diedenigh, ;'P. B.,* French', J. W., & Carlton, S'. Factors in judgments of 

^v^r i.tf ng rati 1 1 ty . . .PciricetoR, "New/ Jersey: Educational* T,esting Service, 1961. 

Diederich. P,. B. ^ Measuring growth, in English .'- Uc>t7afia, 111.: Natiojial 

Council of Teachers of inglish, 1974. ' \ ^ • ' • ^ \ 

Eels, W;. C. ' Reliability of repeated grading of essay-type examinations. 
Journal of Educational Psychology ,' 1930," 21. , / 

Mndlayson, D.- S. The reliability of the^marking ofv essays. British Journal ^ 
of Educational Psychology , 1951, *21, 126-134. ' , ' . 

• ' ' ' y » • ' 

Follman, J. C. , & Anderson>/J- A- An irivestigatio/i' of the reliability of 

i4ye procedures fojr grading English themes. Research in the Teaching of 
-■'^ En^ish , 1967,^4<9O-20O.' ' . • - ^ ' / , ^ 

Freedman, S. How characteristics of student essays influence teachers^ 
evaluation. Journal of Educatibnal Psychology , 1978, 70. , . «^ 

i"*^ • • ' ^ 

God^halk, F. E., Swiheford, F., & Coffma.n, U» E. The measurement of writhing 
abil ity . New Yprk: College Entrance tpcaihl nation Board,* 1966. 

Kinneavy, 0. R. A theory of discourse. In the Aiws of Discourse . Eng-lewoodj^ . 
Cliffs, N.J. : Prentice-Hall , Inc. , 1971.- ' . • 

• ' ^ , ' - ' * 

McColly, W. What does educational research .say ^abo.ut the judging of^'writing 

ability? The Journal of 'Educational Research , 6'4i No> 4, December 1970. ^ 



Office of the Los Angeles. County Superintendent of Schools, A common , - 
ground for assessing competencies in vvritten expression, review copy . 
Los Angeles: Division of Curricylum and Instructional Services, 1977, 

'Pitts, M. The relationship of classroom instructionaV characteristics 
and writing in the descriptive/narrative mode . Report to the National 
Institute of Education, Los Angeles: UCLA Centdr for the Stuidy of 
Evaluation, 1978. (Grant. No. OB-NIE-G-78-0213) 

Prater/ D.^ '& Padi a, W.' Effects of modes of discourse in writing perfor- 
mance in. grades* four and six. Paper presented at the annual meeting 
of the American Educational Research Association, Boston, 1980. 

Quellmalz, E. S. Interim report. Defining writing domains: Effects of 
discourse and response mode. Center for the Study of Evaluation, Uni- 
versity of California, Lbs. Angeles, 197&. y 
• * • ^ • . 

Ouellntajz, E. S., & Capell, F. Defining writing domains:' Effects of dis- 
course and response mode. RepoH to the National Institute of 'Education, 
November, 1979. (Grant No. OB-NIE-G-78-0213 to the Center for. the Study 
of Evaluation) ^ " • ' 

Spooner-Smith, L. Investigation o1^ writing assessment strategies . Report 

to the National Institute of Education, November 1978. (Grant No. OB- 
, NI£-G-78-0213 to the Center for the Study of Evaluation.) 

Stalnaker, J.-^The construction and results of a twelve-hour test in English 
composition. School and Society , 1934, 39^, 

Winters, L. . The effects o? differing.. response criteria on the assessment 
. of writing competence. Grant No. OB-NIE-G-78-0213, Los Angeles^ Calif- 
. ornia: Center for the Study of Evaluation, November 1978. /; 



